Coronavirus disease (COVID-19) is an PANDEMIC infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2).The coronavirus COVID-19 hass affected 212 countries and territories around the world. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. COVID-19 is a pandemic affecting many countries globally.The time between exposure to COVID-19 and the moment when symptoms start is commonly around five to six days but can range from 1 – 14 days. I have gathered Covid-19 data from different sources. Dataset contains confirmed case, deaths and recovered cases, new active cases in world and in India and indian State/UnionTerritory separately.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from timeit import default_timer as timer
import requests
import json
import matplotlib.ticker as ticker
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px
import datetime
from plotly.subplots import make_subplots
import requests
from bs4 import BeautifulSoup
import plotly.io as pio
from pandas.io.json import json_normalize
pio.renderers.default='notebook'
import plotly.graph_objects as go
1. On the basis of all countries: data scraped using beautiful soup https://www.worldometers.info/coronavirus/
2. Covid analysis of India:'covid_19_india.csv'
3. Analysis based on Age:'AgeGroupDetails.csv'
4. Analysis based on ICMR Testing labs:'ICMRTestingDetails.csv'
5. Analysis based on testing done in indian states:'StatewiseTestingDetails.csv'
Source : https://www.kaggle.com/sudalairajkumar/covid19-in-india
6. Analysis based on Time Series Confirmed cases all over world from :
7. Analysis based on Time Series Recovered case all over world:'time_series_covid19_recovered.csv'.
8. Analysis based on Time Series Deaths case all over world:'time_series_covid19_deaths.csv'
Source : https://data.humdata.org/dataset/5dff64bc-a671-48da-aa87-2ca40d7abf02
* Dataset coloums:
- Total_cases
- Total_deaths
- Total_Recovered
- New_cases
- New Deaths
- Active_Cases
- Serious_Case
- TotCases/1Mpop
- Deaths/1Mpop
- Total Tests
- Test/1Mpop
url = "https://www.worldometers.info/coronavirus/"
req_data = requests.get(url)
soup = BeautifulSoup(req_data.text, 'html.parser')
# x = soup.findAll("tbody")
# if x is not None and len(x) > 0:
# section = x[0]
table = soup.find('table', attrs={'id': 'main_table_countries_today'})
header = [col_name.text.rstrip('\n').strip() for col_name in table.select('thead th')]
table_rows = table.find_all('tr')
data = []
for tr in table_rows:
try:
td = tr.find_all('td')
row = [tr.text for tr in td]
#print(row)
data.append(row)
full_data = pd.read_html(str(table))[0]
except requests.Response.raise_for_status() as e:
print("Error: Invalid Response Error.")
full_data.head(10)
covid_india = pd.read_csv('covid_19_india.csv')
covid_india.head(5)
30/jan/2020 india reported first case in kerala
02/march/2020 india reported their covid_19 cases i Telengana ,Delhi and after that covid_19 cases reported to in other state and /UnionTerritory.
13/03/2020 Karnataka reported death of one patients due to covid_19
covid_india_age = pd.read_csv('AgeGroupDetails.csv')
covid_india_age
covid_india_testing = pd.read_csv('ICMRTestingDetails.csv')
covid_india_testing.tail(5)
covid_india_state_testing = pd.read_csv('StatewiseTestingDetails.csv')
covid_india_state_testing
-TimeSeries data contain DATE and Country information.
covid_time_series_C= pd.read_csv('time_series_covid19_confirmed.csv')
covid_time_series_C
covid_time_series_covid_19_R = pd.read_csv('time_series_covid19_recovered.csv')
covid_time_series_covid_19_R
covid_time_series_D = pd.read_csv('time_series_covid19_deaths.csv')
covid_time_series_D
**Assessing Full_data of covid_19 all countries Data
full_data.info()
At the time of writing this report.
Total 216 countries, 177 countries reported deaths.
TotalRecovered cases reported in 207 countries.
82 NewDeaths are reported.
Testing reported in 177 countries.
** Some data is missing. It may be True or confirmed cases, deaths,Recovered cases are not reported. We can not analyse such cases.
***Rename Columns name in full_data dataframe .
full_data.rename(columns = lambda X:X.strip().lower().replace(" ","_"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("country,other","country"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totalcases","total_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("newcases","new_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totaldeaths","total_deaths"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("newdeaths","new_deaths"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totalrecovered","total_recovered"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("activecases","active_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("serious,critical","serious"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totaltests","total_tests"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("tot cases/1m_pop","totcases/1m_pop"),inplace =True)
full_data.fillna(0, inplace=True)
full_data['total_deaths'] = full_data['total_deaths'].astype('int64')
full_data['total_recovered'] = full_data['total_recovered'].astype('int64')
full_data['serious'] = full_data['serious'].astype('int64')
full_data['deaths/1m_pop'] = full_data['deaths/1m_pop'].astype('int64')
full_data['total_tests'] = full_data['total_tests'].astype('int64')
full_data['tests/_1m_pop'] = full_data['tests/_1m_pop'].astype('int64')
full_data['new_cases'] = full_data['new_cases'].str.replace(',', '', regex=True)
full_data['new_deaths'] = full_data['new_deaths'].str.replace(',','', regex=True)
full_data['new_cases'].fillna(0,inplace=True)
full_data['new_deaths'].fillna(0,inplace=True)
full_data['new_cases'] = full_data['new_cases'].astype('int64')
full_data['new_deaths'] = full_data['new_deaths'].astype('int64')
full_data
#full_data['new_cases'] = full_data['new_cases'].Str.replace('+','', regex=True)
#full_data['new_deaths'] = full_data['new_deaths'].str.replace('+','', regex=True)
full_data.drop(full_data.tail(1).index,inplace=True)
full_data.drop(full_data.head(1).index,inplace=True)
full_data.describe()
full_data.sum()
full_data.total_cases.max()
full_data.total_deaths.max()
full_data.total_recovered.max()
full_data.duplicated().sum()
full_data.isna().sum()
full_data['country'].value_counts()
full_data.country.nunique()
Some latitude and longitude data is missing in countries_data.
countries = pd.read_csv('countries_data.csv', encoding= 'unicode_escape')
countries
def getLat(country):
row = countries.loc[countries['name'] == country ]
return row.latitude.values
def getLong(country):
row = countries.loc[countries['name'] == country]
return row.longitude.values
full_data['lat'] =full_data.apply(lambda row: getLat(row['country']), axis=1)
full_data['lat'] = full_data['lat'].str.get(0)
full_data['long'] =full_data.apply(lambda row: getLong(row['country']), axis=1)
full_data['long']= full_data['long'].str.get(0)
full_data['long'] = full_data['long'].astype('float')
full_data.head()
latitude and longitude data country names fix manually in csv file. latitude, longitude data and full_data dataframe from worldometer combined in one table and saved in new csv.
full_data.to_csv('covid.csv')
#full_data['lat'] = full_data.apply(lambda row: getLatL(row['country']), axis=1)
#full_data['long'] = full_data.apply(lambda row: getLong(row['country']), axis=1)
#full_data.info()
** Assessing Data of covid_19 India Data .
covid_india.info()
covid_india.describe()
#covid_india['State/UnionTerritory'].value_counts()
covid_india.duplicated().sum()
covid_india.isna().sum()
covid_india.Deaths.max()
covid_india['State/UnionTerritory'].unique()
covid_india_state_testing.sum()
covid_india_state_testing.duplicated().sum()
covid_india_state_testing.isna().sum()
covid_india_state_testing.describe()
**Assessing time_series Data of covid_19 All Countries Data .
covid_time_series_C.info()
covid_time_series_covid_19_R.info()
covid_time_series_D.info()
covid_time_series_C.isna().sum()
covid_time_series_covid_19_R.isna().sum()
covid_time_series_D.isna().sum()
covid_time_series_C.describe()
covid_time_series_covid_19_R.describe()
covid_time_series_D.describe()
***Drop [Sno.] from column in Covid_india_age dataframe.
covid_india_age = covid_india_age.drop(['Sno'], axis=1)
covid_india_age
*** Missing value (NaN)in 3 covid_time_series dataframe:Confirmed,Recovered,Death.
covid_time_series_covid_19_R.fillna(0, inplace=True)
covid_time_series_C.fillna(0,inplace = True)
covid_time_series_D.fillna(0,inplace =True)
***Drop [ lat,long] from column in Covid time_series dataframe. irrelevent in this analysis
#covid_time_series_covid_19_R = covid_time_series_covid_19_R.drop(['Province/State','Lat','Long'], axis=1)
covid_time_series_covid_19_R = covid_time_series_covid_19_R.drop(['Lat','Long'], axis=1)
covid_time_series_covid_19_R
covid_time_series_C =covid_time_series_C.drop(['Lat','Long'], axis=1)
#covid_time_series_C =covid_time_series_C.drop(['Province/State','Lat','Long'], axis=1)
#covid_time_series_C
covid_time_series_D = covid_time_series_D.drop(['Lat','Long'], axis=1)
#covid_time_series_D = covid_time_series_D.drop(['Province/State','Lat','Long'], axis=1)
#covid_time_series_D
** Difficult to analyze time_series data.
Time series data consist of day wise date in different coloums which is not good for analysis. First we Unpivot date columns[3:] with variable column ‘Date’ and value column ‘Confirmed’, 'Recovered', 'Death'.
dates = covid_time_series_C.columns[3:]
covid_time_series_C = covid_time_series_C.melt(
id_vars=['Country/Region','Province/State'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
covid_time_series_C
covid_time_series_C = covid_time_series_C.groupby(['Country/Region', 'Date'], as_index=False)['Confirmed'].sum()
dates = covid_time_series_D.columns[3:]
covid_time_series_D = covid_time_series_D.melt(
id_vars=['Country/Region','Province/State'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
covid_time_series_D = covid_time_series_D.groupby(['Country/Region', 'Date'], as_index=False)['Deaths'].sum()
covid_time_series_D
dates = covid_time_series_covid_19_R.columns[3:]
covid_time_series_covid_19_R= covid_time_series_covid_19_R.melt(
id_vars=['Country/Region','Province/State'],
value_vars=dates,
var_name='Date',
value_name='Recovered'
)
Also world data consist on provincial data from different countrys. Country data is grouped and aggregated using group by.
Result of group by is a series object with country data grouped by date. I have converted the series object into dataframe to avoid grouping.
covid_time_series_covid_19_R = covid_time_series_covid_19_R.groupby(['Country/Region', 'Date'], as_index=False)['Recovered'].sum()
covid_time_series_covid_19_R
*Merge
Covid_time_series_C, Covid_time_series_D, Covid_time_series_covid_19_R
using merge function.
*Calculate Death_percentage and Recovered_percentage in Covid_time_series dataframe.
covid_time_series= covid_time_series_C.merge(right=covid_time_series_D, how='left',on=['Country/Region', 'Date'])
covid_time_series = covid_time_series.merge( right=covid_time_series_covid_19_R, how='left',on=['Country/Region', 'Date'])
covid_time_series['Death_percentage'] = (covid_time_series.Deaths / covid_time_series.Confirmed)/100
covid_time_series['Recovered_percentage'] = (covid_time_series.Recovered / covid_time_series.Confirmed)/100
covid_time_series.duplicated().sum()
covid_time_series = covid_time_series.drop_duplicates()
As the cases started appearing in countries at different time. This way on a particualr date it isnot a good comparision between difeerent countries as we ought to analyse the outbread tread rate. I calclated the first date of corono positive case in each country and added a column in the dataframe. Then using this date I assigned the days passed after 1st corona case to each row. Now we can compare different couries based one days past first case.
#covid_time_seriesChina = covid_time_series[covid_time_series['Country/Region'] == 'China']
#covid_time_seriesChina.head(50)
covid_time_series
def getfirst_iterrows_loop(df):
for index, row in df.iterrows():
if (row['Confirmed'] == 1):
return row['Date']
return None
df3=covid_time_series.groupby(['Country/Region'])['Country/Region','Confirmed','Date'].apply(getfirst_iterrows_loop).reset_index()
df3.info()
df3.rename(columns={ df3.columns[1]: "Day" }, inplace = True)
from datetime import datetime
def setDay(record):
days = 0
for index, row in df3.iterrows():
if (row['Country/Region'] == record['Country/Region'] and row['Day'] != None):
delta = datetime.strptime(record['Date'], '%m/%d/%y').date() - datetime.strptime(row['Day'], '%m/%d/%y').date()
days = delta.days
return 0 if days < 0 else days
covid_time_series.info()
covid_time_series['Day'] =covid_time_series.apply(lambda row: setDay(row), axis=1)
covid_time_series['Day']
#covid_time_series.head(50)
*** Replace NaN value with Zero in Death_Percentage ,Recovered_percentage in covid_time_series.
covid_time_series.fillna(0, inplace=True)
#covid_time_series['Date'] = pd.to_datetime(covid_time_series['Date'], errors='coerce')
#covid_time_series['Day'] = covid_time_series['Date'].dt.day
#covid_time_series['Day']
*** Store Covid_time_series data in New csv file:
covid_time_series.to_csv('covid_time_series1.csv')
covid_time_series_I = covid_time_series[covid_time_series['Country/Region']=='India']
covid_time_series_I.head(5)
fig = px.scatter(full_data,x="total_cases",y="total_deaths",color='country',log_x=True ,log_y=True ,size_max=100, range_x=[1,1000000000],range_y=[1,1000000])
fig.update_traces(textposition='top center')
fig.update_layout(
# height=800,width=1000,
title_text='Total Deaths Cases in the world',xaxis = dict(
tickangle = 90,
title_text = "Total_cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Deaths Cases",
title_standoff = 10)
)
fig.show()
-***This plot shows country wise death cases vs total confirmend on a logarithmic scale.
As we can see in the graph top five countries on the bases of total confirmed cases, total_deaths rate, recovered cases are
fig = px.scatter(full_data,x="total_cases",y="total_recovered",color='country', log_x=True ,log_y=True ,size_max=100, range_x=[1,10000000],range_y=[1,1000000])
fig.update_traces(textposition='top center')
fig.update_layout(
# height=800,width=1000,
title_text='Total Recovered Cases in the world',
xaxis = dict(
tickangle = 90,
title_text = "Total_cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_recovered Cases",
title_standoff = 10)
)
fig.show()
* The following plot shows country wise recoverd cases on a logarithmic scale. * here all countries is consider for analysis so, there is overlapping in scatter plot.
-Statsmodels module was used for covid_19 all country data analysis that provides classes and functions for the estimation regression models, for conducting statistical tests, and statistical data exploration of covid_19 data ('total_cases' ,'total_deaths' ,'new_cases' ,'new_deaths' ,'total_recovered' ,'active_cases') in all countries.
Assumptions of a regression model:
X = full_data[['total_deaths','new_cases','new_deaths']]
# #### fit a OLS model with intercept on total_cases and new_cases,new_deaths.
y = full_data['total_cases']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.
In order to understand trends we look at the slope of the death cases, new cases and total deaths in linearscale..
From our results, we see that • The intercept 𝛽̂0 = -111.27
The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in total_deaths, new_cases,new_deaths increase total_cases.
• The slope 𝛽̂1 = 5.0908
• The slope 𝛽̂2 = 11.7764 .
• The slope 𝛽̂3 = 321.7351
• The positive 𝛽̂3 parameter estimate implies high new_deaths rate In line with our assumptions, an increase in total_deaths, new_cases, new_deaths appears to increase the total cases.
The p-value means the probability of an 5.0908 increase in total_cases due to a one unit increase in total_deaths is 0%, assuming there is no relationship between the two variables.
** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error 1, is low and therefore appears accurate.
fig = plt.figure(figsize =(15,8))
results = smf.ols('total_cases~total_deaths+new_cases+new_deaths',data = full_data).fit()
sm.graphics.plot_regress_exog(results, 'total_deaths', fig=fig)
plt.show()
2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.
3.The “Partial regression plot” shows the relationship between total_cases and total_deaths,the impact of adding other independent variables on our existing total_deaths coefficient.
4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing total_deaths coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.
X = full_data[['total_recovered','new_cases','active_cases']]
#### fit a OLS model with intercept on total_recovered and new_cases,active_cases.
y = full_data['total_cases']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.
In order to understand trends we look at the slope of the recoverd_cases, new cases and active_case in linearscale..
From our results, we see that • The intercept 𝛽̂0 = 3.0684 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase the recoverd_cases, new cases and active_case increase total_cases.
• The slope 𝛽̂1 = 1.0937.
• The slope 𝛽̂2 = 0.1470
• The slope 𝛽̂3 = 1.0588.
• The positive 𝛽̂2 parameter estimate implies low new_cases rate In line with our assumptions, an increase in the recoverd_cases,active_case appears to increase the total cases.
The p-value means the probability of an 1.1121 increase in total_cases due to a one unit increase in total_recovered is 0%, assuming there is no relationship between the two variables.
** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.007, is low and therefore appears accurate.
fig = plt.figure(figsize =(15,8))
results = smf.ols('total_cases~total_recovered + new_cases+active_cases', data = full_data).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'total_recovered', fig=fig)
plt.show()
###endogenous: caused by factors within the system ,exogenous: caused by factors outside the system
2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.
3.The “Partial regression plot” shows the relationship between total_cases and total_recovered,the impact of adding other independent variables on our existing total_recovered coefficient.
4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing total_recovered coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.
fig = px.scatter(full_data, y='total_cases', x='active_cases',animation_frame="active_cases",text = "country",range_x =[0,100000],range_y=[0,100000])
fig.update_layout(
# height=800,width=1000,
title_text='Total Active_Cases in All countries',xaxis = dict(
tickangle = 90,
title_text = "Active_Cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Cases",
title_standoff = 10)
)
fig.show()
fig2 = px.scatter(full_data, y='total_cases', x='serious', animation_frame="serious",text ='country',range_x =[0,10000],range_y=[0,100000])
fig2.update_layout(
#height=800,width=1000,
title_text='Total Serious Cases in the world',xaxis = dict(
tickangle = 90,
title_text = "Serious_cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Cases",
title_standoff = 10)
)
fig2.show()
fig = px.scatter(full_data, x='tot\xa0cases/1m_pop', y='deaths/1m_pop', color='country',log_x=True ,log_y=True ,size_max=45)
fig.update_layout(
# height=800,width=1000,
title_text='Total Deaths Cases/ 1m_pop in the world',xaxis = dict(
tickangle = 90,
title_text = "Total_cases/1m_pop",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Deaths Cases/1m_pop",
title_standoff = 10)
)
fig.show()
fig2 = px.scatter(full_data, x='tot\xa0cases/1m_pop', y='tests/_1m_pop', color='country',log_x=True ,log_y=True ,size_max=45)
fig2.update_layout(
#height=800,width=1000,
title_text='Total Test/1m_pop in the world',xaxis = dict(
tickangle = 90,
title_text = "Total_cases/1m_pop",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Test/1m_pop",
title_standoff = 10)
)
fig2.show()
full_data.fillna(0, inplace=True)
full_data.head(10)
fig = px.scatter_mapbox(full_data, lat="lat", lon="long",color = 'country', hover_name="country", hover_data=['total_cases', "total_deaths"],
color_continuous_scale=px.colors.cyclical.IceFire,
animation_frame='total_cases',size_max=55, zoom=3)
fig.update_layout(title_text="Total_Cases in World")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
fig = px.scatter_mapbox(full_data, lat="lat", lon="long",color = 'country', hover_name="country", hover_data=['total_cases',"total_deaths" ,"total_recovered"],
color_continuous_scale=px.colors.cyclical.IceFire,
animation_frame='total_deaths',zoom=3)
fig.update_layout(title_text="Total_Deaths in World")
fig.update_layout(
mapbox_style="white-bg",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"source": [
"https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
]
}
])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
full_data = full_data.sort_values(by=['tests/_1m_pop'])
full_data
full_data=pd.melt(full_data, id_vars=['country','tests/_1m_pop'], value_vars=['total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_recovered'])
# plotly
fig = px.line(full_data, x='country', y='value', color='variable',log_x=False ,log_y=True)
fig.update_layout(
# height=800,width=1000,
title_text='Total Cases in the world',xaxis = dict(
tickangle = 90,
title_text = "Country",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Value",
title_standoff = 10)
)
# Show plot
fig.show()
base_color = sns.color_palette()[1]
plt.figure(figsize=(32,6))
g = sns.countplot(data = covid_india, x ='State/UnionTerritory', color = base_color)
g.set_xticklabels(g.get_xticklabels(), rotation=45)
g.set_title('Covid_19 Analysis Based on State/UnionTerriory')
n_points = covid_india.shape[0]
cat_counts = covid_india['State/UnionTerritory'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
count = cat_counts[label.get_text()]
pct_string = '{:0.1f}%'.format(100*count/n_points)
plt.text(loc, count-8, pct_string, ha = 'left',va='bottom', color = 'black')
fig = px.scatter(covid_india,x="Confirmed",y="Deaths" ,animation_frame="Deaths", animation_group="State/UnionTerritory",color="State/UnionTerritory",log_x=True ,log_y=True , range_x=[1,10000],range_y=[1,10000])
fig.update_traces(textposition='top center')
fig.update_layout(
#height=800,width=1000,
title_text='Total Deaths Cases in India',xaxis = dict(
tickangle = 90,
title_text = "Total_cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total_Deaths Cases",
title_standoff = 10),
)
fig.show()
fig = px.scatter(covid_india,x="Confirmed",y="Cured", animation_frame="Cured", animation_group="State/UnionTerritory",color="State/UnionTerritory", log_x=True ,log_y=True ,size_max=45, range_x=[1,10000],range_y=[1,10000])
fig.update_traces(textposition='top center')
fig.update_layout(
#height=800,width=1000,
title_text='Total Recovered Cases in India',xaxis = dict(
tickangle = 90,
title_text = "Total_cases",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Total Recovered Cases",
title_standoff = 10)
)
fig.show()
-Statsmodels module is used for covid_19 cases that provides classes and functions for the estimation regression models,for conducting statistical tests, and statistical data exploration of covid_19('Confirmed cases','Deaths cases','Cured cases')in india.
X = covid_india[['Deaths']]
#### fit a OLS model with intercept on Deaths
y = covid_india['Confirmed']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.
In order to understand trends we look at the slope of the death cases in linearscale..
From our results, we see that • The intercept 𝛽̂0 = 113 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in deaths, new_cases,increase Confirmed _cases. • The slope 𝛽̂1 = 22.66 In line with our assumptions, an increase in deaths appears to increase the confirmed cases.
The p-value means the probability of an 22.66 increase in Confirmed_cases due to a one unit increase in deaths is 0%, assuming there is no relationship between the two variables.
** The standard error measures the accuracy of deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.195, is low and therefore appears accurate.
fig = plt.figure(figsize =(15,8))
#full_data1= sm.dataset.full_data.load_pandas()
results = smf.ols('Confirmed ~Deaths ', data = covid_india).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'Deaths', fig=fig)
plt.show()
2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.
3.The “Partial regression plot” shows the relationship between Confirmed_cases and deaths,the impact of adding other independent variables on our existing deaths coefficient.
4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing deaths coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.
X = covid_india[['Cured']]
#### fit a OLS model with intercept on Cured Cases
y = covid_india['Confirmed']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.
In order to understand trends we look at the slope of the Cured Cases in linearscale..
From our results, we see that • The intercept 𝛽̂0 = 42.61 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in Cured increase Confirmed. • The slope 𝛽̂1 = 4.21 In line with our assumptions, an increase in cured cases appears to increase the Confirmed cases.
The p-value means the probability of an 4.21 increase in Confirmed cases due to a one unit increase in Cured Cases is 0%, assuming there is no relationship between the two variables.
** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.050, is low and therefore appears accurate.
fig = plt.figure(figsize =(15,8))
#full_data1= sm.dataset.full_data.load_pandas()
results = smf.ols('Confirmed ~Cured', data = covid_india).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'Cured', fig=fig)
plt.show()
2.The “Residuals versus cured graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.
3.The “Partial regression plot” shows the relationship between Cured and confirmed cases ,the impact of adding other independent variables on our existing cured coefficient.
4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing cured coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.
f,g = plt.subplots(figsize = (15,10))
base_color = sns.color_palette()[0]
g = sns.barplot(data = covid_india, x = 'Confirmed', y = 'State/UnionTerritory',
label = 'Total Confirmed Cases',color = base_color)
sns.set_color_codes('muted')
g = sns.barplot(x = 'Cured', y = 'State/UnionTerritory', data = covid_india,
label = 'Total number of Cured', color = 'R', edgecolor = 'w')
sns.set_color_codes('pastel')
g= sns.barplot(x = 'Deaths', y = 'State/UnionTerritory', data = covid_india,
label = 'Total number of Deaths', color = 'g', edgecolor = 'w')
g.set_title('Analysis of Covid_19 Cases in India')
g.legend(ncol = 3, loc = 'lower right')
#sns.despine(left = True, bottom = True)
plt.show()
plt.figure(2, figsize=(20,15))
fig,ax = plt.subplots(1, 2)
g =sns.scatterplot(data=covid_india,y="Deaths",x="Confirmed",ax=ax[0],hue="State/UnionTerritory",palette="deep")
g.legend(loc='upper right', bbox_to_anchor=(.20, 0.0), ncol=1)
g =sns.scatterplot(data=covid_india,y="Cured",x="Confirmed",ax=ax[1],hue="State/UnionTerritory",palette="deep")
g.set_title('Covid_19 Deaths Cases and Recovered Cases in India and its State/ Union Territory')
g.legend(loc='upper left', bbox_to_anchor=(1.0, 0.0), ncol=1)
fig = px.bar(covid_india_age, y='TotalCases', x='AgeGroup', text='Percentage')
fig.update_traces(texttemplate='%{text}', textposition='outside',marker_color='lightsalmon')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.update_layout(title_text="Anaysis Based on Age Group in India")
fig.show()
s = covid_india_state_testing.sum()
s
fig = px.scatter(covid_india_state_testing, x='Date', y='Positive', title='Positive Cases Time Series with Rangeslider', range_x=['2020-01-23','2020-05-04'],range_y = [1,100000])
fig.update_traces(marker_color='indianred')
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
# height=800,width=1000,
title_text='Covid_19 Testing in India',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Positive",
title_standoff = 10),
)
fig.show()
fig1 = px.scatter(covid_india_state_testing, x='Date', y='Negative', title='Negative Cases Time Series with Rangeslider',range_x = ['2020-01-23','2020-05-04'],range_y = [1,1000000])
fig1.update_traces(marker_color='lightsalmon')
fig1.update_xaxes(rangeslider_visible=True)
fig1.update_layout(
# height=800,width=1000,
title_text='Covid_19 Testing in India',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Negative",
title_standoff = 10),
)
fig1.show()
Date:22/jan/2020 till 6/may/2020
covid_time_series.head(10)
fig = px.scatter(covid_time_series_C, x='Date', y='Confirmed',color="Country/Region", title='Confirmed Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,1500000])
#fig.update_traces(marker_color='darkslateblue')
fig.update_layout(
# height=800,width=1000,
title_text= 'Datewise Confirmed Cases in all Countries',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Confirmed Cases",
title_standoff = 10),
)
fig.update_xaxes(rangeslider_visible=True)
fig.show()
fig = px.scatter(covid_time_series_D, x='Date', y='Deaths',color="Country/Region", title='Deaths Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,100000])
#fig.update_traces(marker_color='sandybrown')
fig.update_layout(
# height=800,width=1000,
title_text= 'Datewise Deaths Cases in all Countries',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = " Deaths Cases",
title_standoff = 10),
)
fig.update_xaxes(rangeslider_visible=True)
fig.show()
fig = px.scatter(covid_time_series_covid_19_R, x='Date', y='Recovered',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,1000000])
#fig.update_traces(marker_color='green')
fig.update_layout(
#height=800,width=1000,
title_text= 'Datewise Recovered Cases in all Countries',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Recovered Cases",
title_standoff = 10),
)
fig.update_xaxes(rangeslider_visible=True)
fig.show()
Date:22/jan/2020 till 6/may/2020
fig1 = px.scatter(covid_time_series_I, x='Date', y='Confirmed',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig1.update_layout(
# height=800,width=1000,
title_text= 'Datewise Confirmed Cases in India',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Confirmed Cases",
title_standoff = 10),
)
fig1.update_xaxes(rangeslider_visible=True)
fig1.show()
fig2 = px.scatter(covid_time_series_I, x='Date', y='Deaths',color="Country/Region",range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig2.update_layout(
#height=800,width=1000,
title_text= 'Datewise Deaths Cases in India ',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Deaths Cases",
title_standoff = 10),
)
fig2.update_xaxes(rangeslider_visible=True)
fig2.show()
fig3 = px.scatter(covid_time_series_I, x='Date', y='Recovered',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig3.update_layout(
#height=800,width=1000,
title_text= 'Datewise Recovered Cases in India ',xaxis = dict(
tickangle = 90,
title_text = "Date",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Recovered Cases",
title_standoff = 10),
)
fig3.update_xaxes(rangeslider_visible=True)
fig3.show()
Countries had the highest number of confirmed cases , Deaths Cases and recovered cases in one particular day.
fig = px.scatter(covid_time_series, x='Day', y='Confirmed',color="Country/Region",log_x=False ,log_y=True )
fig.update_layout(
#height=800,width=1000,
title_text= 'Daywise Confirmed Cases in All Countries ',xaxis = dict(
tickangle = 90,
title_text = "Day",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Confirmed Cases",
title_standoff = 10),
)
fig.show()
fig1 = px.scatter(covid_time_series, x='Day', y='Deaths',color="Country/Region",log_x=False ,log_y=True )
fig1.update_layout(
#height=800,width=1000,
title_text= 'Daywise Deaths Cases in All Countries ',xaxis = dict(
tickangle = 90,
title_text = "Day",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Deaths Cases",
title_standoff = 10),
)
fig1.show()
fig2 = px.scatter(covid_time_series, x='Day', y='Recovered',color="Country/Region",log_x=False ,log_y=True )
fig2.update_layout(
#height=800,width=1000,
title_text= 'Daywise Recovered Cases in All Countries ',xaxis = dict(
tickangle = 90,
title_text = "Day",
title_font = {"size": 15},
title_standoff = 10),
yaxis = dict(
title_text = "Recovered Cases",
title_standoff = 10),
)
fig2.show()
- First cases of covid_19 is reported in Wuhan (the city where the virus originated)Central China, with a population of over 11 million people.The city, on January 23.
- 2-14 days represents the current official estimated range for the novel coronavirus COVID-19.
- On January Month , the novel coronavirus cases in the UK,Russia,Sweden, Spain were reported less in number.
- On March and April Months there is tremendous increase of Confirmed cases and Deaths Cases in all over World.
** Total Confirmed Case in World =3,913,644
** Total Deaths Case in World = 270,426
** Total Recovered Case in World=1,341,022
** Total Confirmed Case in India = 56,342
** Total Deaths Case in India = 16,540
** Total Recovered Case in India =1,886
(* current data reported in conclusion)
Time series data show that their increase in confirmed cases rapidly .and also the death rate increase in countries .
Time Series data also show that recovered cases more than the death cases in countries .
Confirmed refers to a case being reported in contrast to a case being infected. Therefore, fluctuations in days .This is due to the sampling bias induced by the limited amount of corona test kits.
-So many cases that do not yet show symptoms are tested and the mortality rate below is shown in dead per million inhabitants.
def hide_code_in_slideshow():
from IPython import display
import binascii
import os
uid = binascii.hexlify(os.urandom(8)).decode()
html = """<div id="%s"></div>
<script type="text/javascript">
$(function(){
var p = $("#%s");
if (p.length==0) return;
while (!p.hasClass("cell")) {
p=p.parent();
if (p.prop("tagName") =="body") return;
}
var cell = p;
cell.find(".input").addClass("hide-in-slideshow")
});
</script>""" % (uid, uid)
display.display_html(html, raw=True)
hide_code_in_slideshow()
#jupyter nbconvert presentation.ipynb --to slides --template output-toggle.tpl
#jupyter nbconvert Jupyter\ "covid analysis-final.ipynb" --to slides --template output-toggle.tpl --post serve --post serve